{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "a627af7f",
      "metadata": {
        "id": "a627af7f"
      },
      "source": [
        "# Qwen2.5-VL (7B) Vision Model"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6ca9476a",
      "metadata": {
        "id": "6ca9476a"
      },
      "source": [
        "## Description\n",
        "This notebook demonstrates how to use the **Qwen2.5-VL (7B) Vision Model** vision-language model for image input and text generation tasks."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhivyaBharathy-web/PraisonAI/blob/main/examples/cookbooks/Qwen2_5_VL_7B_Vision_Model.ipynb)"
      ],
      "metadata": {
        "id": "0G9UGDLJk8uV"
      },
      "id": "0G9UGDLJk8uV"
    },
    {
      "cell_type": "markdown",
      "id": "f467ebfe",
      "metadata": {
        "id": "f467ebfe"
      },
      "source": [
        "## Dependencies\n",
        "```python\n",
        "!pip install transformers accelerate torch torchvision\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ea76265d",
      "metadata": {
        "id": "ea76265d"
      },
      "source": [
        "## Tools\n",
        "- 🤗 Transformers\n",
        "- PyTorch\n",
        "- Vision Model APIs"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "37c57a37",
      "metadata": {
        "id": "37c57a37"
      },
      "source": [
        "## YAML Prompt\n",
        "```yaml\n",
        "mode: vision\n",
        "model: Qwen2.5-VL (7B) Vision Model\n",
        "tasks:\n",
        "  - image captioning\n",
        "  - visual question answering\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7b02bc60",
      "metadata": {
        "id": "7b02bc60"
      },
      "source": [
        "## Main\n",
        "Below is the simplified main execution to load the model and run basic inference."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6c397c49",
      "metadata": {
        "id": "6c397c49"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "import os\n",
        "if \"COLAB_\" not in \"\".join(os.environ.keys()):\n",
        "    !pip install unsloth\n",
        "else:\n",
        "    # Do this only in Colab notebooks! Otherwise use pip install unsloth\n",
        "    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo\n",
        "    !pip install sentencepiece protobuf \"datasets>=3.4.1\" huggingface_hub hf_transfer\n",
        "    !pip install --no-deps unsloth"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "18b3d67c",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 513,
          "referenced_widgets": [
            "73fc3ac47de840cda7a07297153404d5",
            "8c8b54f4d34a4b59a5a2513bbc082297",
            "7bd2abffca3f4193afffd3f1a7c8ff8a",
            "9ecdc7e0e9c247c59e9eab93455153b8",
            "84f4e9f7ff6f433bbf13078e43ebbe2a",
            "7475f8a404624b1b844d8148fabb493b",
            "9d4aae9ebe594bd6875a5963eb6d77ed",
            "62b6d78abada429393c499e0479eb56e",
            "2aecce21a3a04971b4f5c9fd0b059738",
            "2af66136423c41c6908c5a61c6fa2e5d",
            "8bf92078d48448848402ede49b1b5182",
            "78ae8aea29e04420bcfe616061934d98",
            "e22b1706b6be49af8e31d2c512ccaaf6",
            "71ac48ba667b4af291df3a33e7643c6c",
            "7d16f0e5aaec4429a5b8d3c4878ea74a",
            "a21de38381c241309bc798be604c3bf5",
            "f26f1de0f3f34010ac0d37d6f3ed74b6",
            "0af61827920e415daf64cfcb1685ee63",
            "77a82cf2255b43e088812087af4d8ead",
            "fad668efab014d28877087f38f3c034e",
            "e764888263e54f0ba2f603992d678096",
            "2dccbbbdfbf645b8b5896953f6e8a59b",
            "45385885a26e42d0b1f52eda382fd6f4",
            "ccb21e1b7fc14a6ab10689154ab2b94a",
            "f5c7ab8d65c4418a943a51a401720f67",
            "f6aaa74bce474abcbc9daf0515b6ffbd",
            "76bd63e567e94650a82272ca6fd85011",
            "d3cac12f194a40a187bbba68da36173c",
            "30439a7bf22f416cbc01cfe87e433e1f",
            "3265b3490b11426a8aa44d541981a296",
            "78a31b91439c4c99a07f8f8d8dfb548f",
            "d2a3a751ac2f491c9e7cb7a4092ab0b6",
            "b652bb727c1148c4aac9ed751920e3ae",
            "14e519d5e0834fd9a70a111140f8429a",
            "41f3770e6d864c1794a1fa93feaf8605",
            "6aa1670b808b42db894756b904243947",
            "5757f1d68b8f4571a05486d846127b14",
            "14cbf1b61170470bbb9d63e31d0ca310",
            "3324ce12dc3e45fab9984eb83a92eb39",
            "0ccabea3b39d4b148d420e26b53d3990",
            "2d65a3b83b524eeaafde8e406e90198f",
            "d179ef65b2b842d69548302c2f7a31be",
            "2362fd437a134c30b147df947ee97128",
            "7f22e31fc7174ff2bfa8bc2624d33dc7",
            "643db314d8e94feeb9cd8747909eaabe",
            "b80bb8f2cc92400aa34934978f9bc865",
            "03f0c63c572249aabb7fb61272af1f1b",
            "4113f045f9bc440b8dfaed5dfab14c1b",
            "86a231a75d4a4586a53e4c7e74c63550",
            "573653d0298246868e5d93069ca0fef5",
            "c20a42f12c4545dfafdba22267e44d4b",
            "4b74060c0bb44a35b847e6752b93e463",
            "7e211233455342d48bd60f6d499b91f5",
            "520ca077c6854e6fa1e019c0b209eae7",
            "8f1fce549e494998b13f523bd793d300",
            "bbb5e6a0a8b64d10802fbf3e89778306",
            "0cef8128eeba4638bd268f9a1a364427",
            "399cc918cec24db7b99af9e5f3c5adb4",
            "3a611a2e00be4da6833172d928802fb9",
            "0f2f303ae91349e29132055105fcbdec",
            "17c08a96e04244a995e658eeb5a69699",
            "9269dee0b0104b22b4043d3661f79350",
            "a405cbc6d42d4f8f8be7c05f4e34128c",
            "e702e948bb34481d995c75be223c4e16",
            "c9949e2b99d14fa0bc4cfc88f42c2afb",
            "48bbf9b662af47c5b99ebe24a6b70549",
            "c1af10a57c0a40cf818d0974facdcc36",
            "e4c1ff933c6e4ef49afbedffcdd99d5c",
            "3caf3a6d48da403099bcdee89bda6949",
            "7e0bbe2417e0420db1502390ba7c6c2d",
            "5522658c84dc47a491f5deb08353b631",
            "aecb331a01a44c9e92c74840faaf2616",
            "caadc7751ff847b4a831dd6b2b53e60b",
            "f0dd55c8f71a4f47a91a7018ab89d092",
            "7a2d1ba8441e4a368a505c38d2c811ad",
            "6657bbba902e43429224ed9c426389fe",
            "a25ba73fb99448a3b46ac3571e2db0be",
            "c2f518986b724620bbc67f7b6ede6d96",
            "ed214eea73f54928af276883ddd53b5e",
            "b241468a1d684fbb820baff6b9f9e362",
            "0f7da532813445f1be01977cdec364ab",
            "de7ac823a59d488995aa37bfbcec2e5b",
            "fc5321766b8742ca97c4a839ba30d421",
            "c4256f50e38e481083f3d366bfd91d3f",
            "f7e4e90fb3fe4bbb85adb039404dd8e2",
            "e0f8b5f130064bd69f4f3298e9536b5a",
            "20956dab8f3e4f22830b4d9e56a5232a",
            "4efe313f923c4ae585fb8a4353cecd97",
            "3af59a3e9ac2430da201262fc54b5998",
            "ad746cc40cdb461bba8cf04033d4afc5",
            "3d09770f22854ca38213d4a64ace6df3",
            "910df16ec7494eb68fb1f8ce95046414",
            "080c7eed9cc0475980e85a530c91e616",
            "e4fe961d149b461fb1d168d226c8aa1d",
            "2a588c12d3814de495fa865a63629ba8",
            "607a235508e54fdda0aef75311c7a213",
            "383e14f14b0d4328a80e6b0283c421fa",
            "913324d3476a4d04887d200f91afc553",
            "4c0c6892c4f347deb4a85544664ff1e1",
            "a863bb17b5984981bf9bd4ff93e438f4",
            "b28286deba7c4152a8d2c21ea0d19402",
            "fb9e17975b2349de81115173af6ac17a",
            "f3b068de4fa6466e88970b73c868de0a",
            "2127bee2928e43578b8b51e3c1fdfabf",
            "a97f3d8c7bdf4292ae8ed133f059f635",
            "67053bf18cd04347a0f2d2ff6134ae88",
            "80e56300aee14b4386832aa1618c5d43",
            "d096e33830754ea5b2e8f102548b805a",
            "a913ea3370404f5f972754715889aced",
            "35db2f7869f44aeab422409d27d5f7ec"
          ]
        },
        "id": "18b3d67c",
        "outputId": "f0d7f72f-8166-439b-96da-ed7e30f8bdb3"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.\n",
            "🦥 Unsloth Zoo will now patch everything to make training faster!\n",
            "==((====))==  Unsloth 2025.3.19: Fast Qwen2_5_Vl patching. Transformers: 4.50.0.\n",
            "   \\\\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.\n",
            "O^O/ \\_/ \\    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0\n",
            "\\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]\n",
            " \"-____-\"     Free license: http://github.com/unslothai/unsloth\n",
            "Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "73fc3ac47de840cda7a07297153404d5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "model.safetensors:   0%|          | 0.00/5.97G [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "78ae8aea29e04420bcfe616061934d98",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "generation_config.json:   0%|          | 0.00/267 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "45385885a26e42d0b1f52eda382fd6f4",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "preprocessor_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "14e519d5e0834fd9a70a111140f8429a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "tokenizer_config.json:   0%|          | 0.00/7.33k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "643db314d8e94feeb9cd8747909eaabe",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "bbb5e6a0a8b64d10802fbf3e89778306",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c1af10a57c0a40cf818d0974facdcc36",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c2f518986b724620bbc67f7b6ede6d96",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "3af59a3e9ac2430da201262fc54b5998",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "a863bb17b5984981bf9bd4ff93e438f4",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from unsloth import FastVisionModel # FastLanguageModel for LLMs\n",
        "import torch\n",
        "\n",
        "# 4bit pre quantized models we support for 4x faster downloading + no OOMs.\n",
        "fourbit_models = [\n",
        "    \"unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit\", # Llama 3.2 vision support\n",
        "    \"unsloth/Llama-3.2-11B-Vision-bnb-4bit\",\n",
        "    \"unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit\", # Can fit in a 80GB card!\n",
        "    \"unsloth/Llama-3.2-90B-Vision-bnb-4bit\",\n",
        "\n",
        "    \"unsloth/Pixtral-12B-2409-bnb-4bit\",              # Pixtral fits in 16GB!\n",
        "    \"unsloth/Pixtral-12B-Base-2409-bnb-4bit\",         # Pixtral base model\n",
        "\n",
        "    \"unsloth/Qwen2-VL-2B-Instruct-bnb-4bit\",          # Qwen2 VL support\n",
        "    \"unsloth/Qwen2-VL-7B-Instruct-bnb-4bit\",\n",
        "    \"unsloth/Qwen2-VL-72B-Instruct-bnb-4bit\",\n",
        "\n",
        "    \"unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit\",      # Any Llava variant works!\n",
        "    \"unsloth/llava-1.5-7b-hf-bnb-4bit\",\n",
        "] # More models at https://huggingface.co/unsloth\n",
        "\n",
        "model, tokenizer = FastVisionModel.from_pretrained(\n",
        "    \"unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit\",\n",
        "    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.\n",
        "    use_gradient_checkpointing = \"unsloth\", # True or \"unsloth\" for long context\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "45f1a7d8",
      "metadata": {
        "id": "45f1a7d8"
      },
      "outputs": [],
      "source": [
        "model = FastVisionModel.get_peft_model(\n",
        "    model,\n",
        "    finetune_vision_layers     = True, # False if not finetuning vision layers\n",
        "    finetune_language_layers   = True, # False if not finetuning language layers\n",
        "    finetune_attention_modules = True, # False if not finetuning attention layers\n",
        "    finetune_mlp_modules       = True, # False if not finetuning MLP layers\n",
        "\n",
        "    r = 16,           # The larger, the higher the accuracy, but might overfit\n",
        "    lora_alpha = 16,  # Recommended alpha == r at least\n",
        "    lora_dropout = 0,\n",
        "    bias = \"none\",\n",
        "    random_state = 3407,\n",
        "    use_rslora = False,  # We support rank stabilized LoRA\n",
        "    loftq_config = None, # And LoftQ\n",
        "    # target_modules = \"all-linear\", # Optional now! Can specify a list if needed\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "915566b6",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 177,
          "referenced_widgets": [
            "5e4f3f4910d1477b924362c095a0cedd",
            "b5a01de096fa4e2daf66e235bd5686bb",
            "e9195f3a3e29475f941c302fe2a88a83",
            "0d6f6fb66522436786728d8602e2f466",
            "94b72e08764c486d8eed43a4557f1d98",
            "8e8281a56a3d4879b6b376077890c920",
            "630894e18b264f7dae8eddb17304a1cd",
            "697fc92abb824cf4a9bc78709640ba54",
            "eb6d552d6c1d4e82a521b29b36ae3db0",
            "7572f2d06114442ba94fc22761a55346",
            "c6dcd15e155949cbb9698e43e55b7a90",
            "7365368b63d9425db5d4e5514e6e0586",
            "5997909fc34645ce9a03d82ab0932766",
            "ea4ebb90b708493c85fa81fecf4d60c0",
            "227724312eb045a58616f63c3a5b7407",
            "4f58e8a2ddbb4a73a2421ea37c68fea9",
            "7e89e5546e394162b7b0f793478527ff",
            "db9ea237050045d8861d76bb3b995753",
            "3a484309bbe148abb2ef05452850728a",
            "5e601d00c0094d999a5b37b9555c111c",
            "e7bd6c674af541d08c0acb5e47579156",
            "927147d620b0426cb46b49b6d8dc2ffc",
            "4ebb36e479884847bcef5736e276dfb9",
            "7b5a059a8ef84817851d109fd7e3a7b5",
            "0e949ae97d544d65af09cebe63041b8e",
            "c817a779a9ff4a848391a7c7dff1e4ca",
            "abae462b98c84006b71d4a62c8a47d5a",
            "dc1a755aae6c44849d818fd96c046649",
            "2650d2ce196e404a98cbfac96d0a9bef",
            "36d579fdc2fb4c59b827f5bf7a92abd4",
            "fccec1c2a389401eb614f5f1cfb15f0c",
            "7b4d303966274c71b50312df12e7b87a",
            "ce4bb2dffcf84ae9bf1f8ac86b2d768e",
            "639b886f72e24951909e4ffb6cfa247d",
            "7b13cf1727a04f5ca6de1f7eb536d800",
            "631ce764f4684225a3c1e4455717a555",
            "48af93dada914d8b88d5ffffd24ac666",
            "4bd339c9d3ee43f39678d2c18baeef53",
            "b1b12b8def994e748b08b27abcef4b87",
            "bc38022049ce47628d8c215959c38b91",
            "331d5d2bc67f44f283fc82481cb60411",
            "0a5ad3bd338b4e6ea94a11a199d2f5ba",
            "5c41f695158641f485349c20a1405cfd",
            "89bc17f1911a43cca7196da232500fd6",
            "f5629e8c82c848218f69810e1b21b0e4",
            "5244452af0734f3183a7384435f9e1fe",
            "69f645decfb64baca11cfbdb9773a458",
            "d1788fabe3e34f6eb60fecab3c7490ba",
            "8f423066ba8f421495f934a269053cc1",
            "569d1435402e4533b0427a4b191b6754",
            "b834a7ab12544f9b8305f5728677e921",
            "e72ac7c9b7824c81a6e77e6fadad9f9c",
            "828ad6e5fcfb44b3b4559ec5682d095d",
            "dac611678f4949ae98051bdc975e083f",
            "0a21bae13a3c46509db70dbe8e9de5eb"
          ]
        },
        "id": "915566b6",
        "outputId": "ece64855-55d4-4011-c5b8-33a9bb622734"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5e4f3f4910d1477b924362c095a0cedd",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "README.md:   0%|          | 0.00/519 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7365368b63d9425db5d4e5514e6e0586",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "train-00000-of-00001.parquet:   0%|          | 0.00/344M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4ebb36e479884847bcef5736e276dfb9",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "test-00000-of-00001.parquet:   0%|          | 0.00/38.2M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "639b886f72e24951909e4ffb6cfa247d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Generating train split:   0%|          | 0/68686 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "f5629e8c82c848218f69810e1b21b0e4",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Generating test split:   0%|          | 0/7632 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "dataset = load_dataset(\"unsloth/LaTeX_OCR\", split = \"train\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6193435d",
      "metadata": {
        "id": "6193435d"
      },
      "source": [
        "## Output\n",
        "This shows a basic output example of the vision-language model."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}